10 research outputs found
Towards Semi-Supervised Learning for Deep Semantic Role Labeling
Neural models have shown several state-of-the-art performances on Semantic
Role Labeling (SRL). However, the neural models require an immense amount of
semantic-role corpora and are thus not well suited for low-resource languages
or domains. The paper proposes a semi-supervised semantic role labeling method
that outperforms the state-of-the-art in limited SRL training corpora. The
method is based on explicitly enforcing syntactic constraints by augmenting the
training objective with a syntactic-inconsistency loss component and uses
SRL-unlabeled instances to train a joint-objective LSTM. On CoNLL-2012 English
section, the proposed semi-supervised training with 1%, 10% SRL-labeled data
and varying amounts of SRL-unlabeled data achieves +1.58, +0.78 F1,
respectively, over the pre-trained models that were trained on SOTA
architecture with ELMo on the same SRL-labeled data. Additionally, by using the
syntactic-inconsistency loss on inference time, the proposed model achieves
+3.67, +2.1 F1 over pre-trained model on 1%, 10% SRL-labeled data,
respectively.Comment: EMNLP 201
Gradient-based Inference for Networks with Output Constraints
Practitioners apply neural networks to increasingly complex problems in
natural language processing, such as syntactic parsing and semantic role
labeling that have rich output structures. Many such structured-prediction
problems require deterministic constraints on the output values; for example,
in sequence-to-sequence syntactic parsing, we require that the sequential
outputs encode valid trees. While hidden units might capture such properties,
the network is not always able to learn such constraints from the training data
alone, and practitioners must then resort to post-processing. In this paper, we
present an inference method for neural networks that enforces deterministic
constraints on outputs without performing rule-based post-processing or
expensive discrete search. Instead, in the spirit of gradient-based training,
we enforce constraints with gradient-based inference (GBI): for each input at
test-time, we nudge continuous model weights until the network's unconstrained
inference procedure generates an output that satisfies the constraints. We
study the efficacy of GBI on three tasks with hard constraints: semantic role
labeling, syntactic parsing, and sequence transduction. In each case, the
algorithm not only satisfies constraints but improves accuracy, even when the
underlying network is state-of-the-art.Comment: AAAI 201
An Introduction to Lifelong Supervised Learning
This primer is an attempt to provide a detailed summary of the different
facets of lifelong learning. We start with Chapter 2 which provides a
high-level overview of lifelong learning systems. In this chapter, we discuss
prominent scenarios in lifelong learning (Section 2.4), provide 8 Introduction
a high-level organization of different lifelong learning approaches (Section
2.5), enumerate the desiderata for an ideal lifelong learning system (Section
2.6), discuss how lifelong learning is related to other learning paradigms
(Section 2.7), describe common metrics used to evaluate lifelong learning
systems (Section 2.8). This chapter is more useful for readers who are new to
lifelong learning and want to get introduced to the field without focusing on
specific approaches or benchmarks. The remaining chapters focus on specific
aspects (either learning algorithms or benchmarks) and are more useful for
readers who are looking for specific approaches or benchmarks. Chapter 3
focuses on regularization-based approaches that do not assume access to any
data from previous tasks. Chapter 4 discusses memory-based approaches that
typically use a replay buffer or an episodic memory to save subset of data
across different tasks. Chapter 5 focuses on different architecture families
(and their instantiations) that have been proposed for training lifelong
learning systems. Following these different classes of learning algorithms, we
discuss the commonly used evaluation benchmarks and metrics for lifelong
learning (Chapter 6) and wrap up with a discussion of future challenges and
important research directions in Chapter 7.Comment: Lifelong Learning Prime
Making Scalable Meta Learning Practical
Despite its flexibility to learn diverse inductive biases in machine learning
programs, meta learning (i.e., learning to learn) has long been recognized to
suffer from poor scalability due to its tremendous compute/memory costs,
training instability, and a lack of efficient distributed training support. In
this work, we focus on making scalable meta learning practical by introducing
SAMA, which combines advances in both implicit differentiation algorithms and
systems. Specifically, SAMA is designed to flexibly support a broad range of
adaptive optimizers in the base level of meta learning programs, while reducing
computational burden by avoiding explicit computation of second-order gradient
information, and exploiting efficient distributed training techniques
implemented for first-order gradients. Evaluated on multiple large-scale meta
learning benchmarks, SAMA showcases up to 1.7/4.8x increase in throughput and
2.0/3.8x decrease in memory consumption respectively on single-/multi-GPU
setups compared to other baseline meta learning algorithms. Furthermore, we
show that SAMA-based data optimization leads to consistent improvements in text
classification accuracy with BERT and RoBERTa large language models, and
achieves state-of-the-art results in both small- and large-scale data pruning
on image classification tasks, demonstrating the practical applicability of
scalable meta learning across language and vision domains
Efficient Lifelong Learning in Deep Neural Networks: Optimizing Architecture, Training, and Data
The prevalent machine learning paradigm involves training a separate model for every new task given a static dataset. In contrast, humans accumulate knowledge over time, and the lifelong learning paradigm seeks to emulate this process by enabling systems to learn continuously from a stream of tasks, retaining past knowledge for efficient future learning. This paradigm also offers advantages such as avoiding periodic model training, potentially reducing computational and energy requirements, and promoting environmentally friendly Green AI. In modern machine learning, deep neural networks, while powerful, face challenges like catastrophic forgetting (losing knowledge from previous tasks during new task learning) and negative interference (previously learned knowledge hindering new task learning). These issues arise from the stability-plasticity dilemma, which necessitates finding the right balance between preserving past knowledge (stability) and acquiring new knowledge (plasticity). Efficient lifelong learning systems must address this dilemma, along with other considerations like supporting online data streams, utilizing small and fixed memory buffer capacity (if any), and learning from unlabeled data streams.
In this thesis, we derive inspiration from the biological learning process and recent progress in deep learning to enable efficient lifelong learning systems. We propose injecting inductive biases into the three main components of data-driven machine learning: model (architecture & initialization), training (objective & optimization), and data. This thesis is structured into three parts, each corresponding to one of these components. In the first part, we explore the role of pre-trained initializations, revealing their implicit alleviation of forgetting compared to random ones. Next, we design a parameter-efficient expert architecture that dynamically expands learning capacity to address the stability-plasticity dilemma. In the second part, we demonstrate that explicit optimization for flat minima improves network stability and introduce a meta-learning objective for stability-plasticity balance. The third part delves into lifelong semi-supervised learning, addressing the stability-plasticity dilemma by rehearsing pseudo-labeled data. We conclude by examining pre-training from the perspective of lifelong learning, showcasing enhancements by applying the above-developed strategies to the (continual) pre-training of models </p
Train Flat, Then Compress: Sharpness-Aware Minimization Learns More Compressible Models
Model compression by way of parameter pruning, quantization, or distillation
has recently gained popularity as an approach for reducing the computational
requirements of modern deep neural network models for NLP. Pruning unnecessary
parameters has emerged as a simple and effective method for compressing large
models that is compatible with a wide variety of contemporary off-the-shelf
hardware (unlike quantization), and that requires little additional training
(unlike distillation). Pruning approaches typically take a large, accurate
model as input, then attempt to discover a smaller subnetwork of that model
capable of achieving end-task accuracy comparable to the full model. Inspired
by previous work suggesting a connection between simpler, more generalizable
models and those that lie within flat basins in the loss landscape, we propose
to directly optimize for flat minima while performing task-specific pruning,
which we hypothesize should lead to simpler parameterizations and thus more
compressible models. In experiments combining sharpness-aware minimization with
both iterative magnitude pruning and structured pruning approaches, we show
that optimizing for flat minima consistently leads to greater compressibility
of parameters compared to standard Adam optimization when fine-tuning BERT
models, leading to higher rates of compression with little to no loss in
accuracy on the GLUE classification benchmark.Comment: 12 page
An Empirical Investigation of the Role of Pre-training in Lifelong Learning
The lifelong learning paradigm in machine learning is an attractive
alternative to the more prominent isolated learning scheme not only due to its
resemblance to biological learning but also its potential to reduce energy
waste by obviating excessive model re-training. A key challenge to this
paradigm is the phenomenon of catastrophic forgetting. With the increasing
popularity and success of pre-trained models in machine learning, we pose the
question: What role does pre-training play in lifelong learning, specifically
with respect to catastrophic forgetting? We investigate existing methods in the
context of large, pre-trained models and evaluate their performance on a
variety of text and image classification tasks, including a large-scale study
using a novel data set of 15 diverse NLP tasks. Across all settings, we observe
that generic pre-training implicitly alleviates the effects of catastrophic
forgetting when learning multiple tasks sequentially compared to randomly
initialized models. We then further investigate why pre-training alleviates
forgetting in this setting. We study this phenomenon by analyzing the loss
landscape, finding that pre-trained weights appear to ease forgetting by
leading to wider minima. Based on this insight, we propose jointly optimizing
for current task loss and loss basin sharpness to explicitly encourage wider
basins during sequential fine-tuning. We show that this optimization approach
outperforms several state-of-the-art task-sequential continual learning
algorithms across multiple settings, occasionally even without retaining a
memory that scales in size with the number of tasks